Adding promotion for UnknownType per V3+ spec#2155
Adding promotion for UnknownType per V3+ spec#2155ldsantos0911 wants to merge 2 commits intoapache:mainfrom
Conversation
|
Thanks for working on this! The changes LGTM. Tagging @Fokko for another pair of eyes :) |
kevinjqliu
left a comment
There was a problem hiding this comment.
I think there are 2 different issues here.
For V3 tables, yes, I think we can map pyarrow's null type to UnknownType
For V1/V2 tables, I think its correct that visit_pyarrow should throw for pyarrow null type. There's not an equivalent iceberg type mapping for pyarrow's null type.
The example in #2119, the workaround is to make sure that the pyarrow table does not contain null type. you can do that by explicitly passing the schema when creating the pyarrow table
import pyarrow as pa
# table created with the below pyarrow schema
schema = pa.schema(
[
pa.field("col1", pa.string(), nullable=True),
]
)
df = pa.Table.from_pylist(
[
{"col1": None}
],
schema=schema
)
df.schema
Thank you for that additional context regarding V1/V2; I believe that makes sense. As far as V3 tables go, it looks like |
|
|
||
| @promote.register(UnknownType) | ||
| def _(file_type: UnknownType, read_type: IcebergType) -> IcebergType: | ||
| return read_type # Per V3 Spec, "Unknown" can be promoted to any type |
There was a problem hiding this comment.
I interpreted this as promotions available for a primitive type rather than promotions to a primitive type. i.e. UnknownType is a primitive type and these are the types it can be promoted to. It should be easy enough to change, so I'm fine either way. But, to me it would make sense that if a {List,Map,Struct}Type field is nullable and a null PyArrow field is being evaluated for compatibility, that evaluation should succeed. What are your thoughts?
There was a problem hiding this comment.
I think we need to extend this logic a bit:
| return read_type # Per V3 Spec, "Unknown" can be promoted to any type | |
| # Per V3 Spec, "Unknown" can be promoted to any Primitive type | |
| if isinstance(read_type, PrimitiveType): | |
| return file_type | |
| else: | |
| raise ResolveError(f"Cannot promote {file_type} to {read_type}") |
|
|
||
| def test_schema_conversion_null_column(table_schema_simple: Schema) -> None: | ||
| pyarrow_schema: pa.Schema = schema_to_pyarrow(table_schema_simple) | ||
| new_field = pyarrow_schema.field(2).with_type(pa.null()) # Make the optional boolean field null for testing |
There was a problem hiding this comment.
How about creating a new schema with just a field with a NestedType instead? I don't think we want to re-assign fields as that makes the test harder te read.
There was a problem hiding this comment.
Sure I can make a new schema. I think I was just trying to get it working and forgot to clean this up 😀
|
@ldsantos0911 Thanks for working on this, left a few comments, but this PR makes a lot of sense to me 👍 |

Rationale for this change
When attempting to write a PyArrow dataframe that has a
nullfield corresponding to a nullable Table field, I see this error:Per the V3 spec,
UnknownType(aspa.null()is cast) should be promotable to any type. This change attempts to unblock situations where a DataFrame may end up having a null type for an optional field while also incorporating the V3 spec.Note: This issue is related but was based on an older version. Nonetheless, the underlying situation is still blocked.
Are these changes tested?
Yes. Added a couple of relevant unit tests. Additionally, I did a manual sanity check:
Are there any user-facing changes?
Yes, this will allow UnknownType promotion.